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Estimation of the density of regression errors is a fundamental is- 
sue in regression analysis and it is typically explored via a parametric 
approach. This article uses a nonparametric approach with the mean 
integrated squared error (MISE) criterion. It solves a long-standing 
problem, formulated two decades ago by Mark Pinsker, about estima- 
tion of a nonparametric error density in a nonparametric regression 
setting with the accuracy of an oracle that knows the underlying re- 
gression errors. The solution implies that, under a mild assumption 
on the differentiability of the design density and regression function, 
the MISE of a data-driven error density estimator attains minimax 
rates and sharp constants known for the case of directly observed 
regression errors. The result holds for error densities with finite and 
infinite supports. Some extensions of this result for more general het- 
eroscedastic models with possibly dependent errors and predictors 
are also obtained; in the latter case the marginal error density is 
estimated. In all considered cases a blockwise-shrinking Efromovich- 
Pinsker density estimate, based on plugged-in residuals, is used. The 
obtained results imply a theoretical justification of a customary prac- 
tice in applied regression analysis to consider residuals as proxies for 
underlying regression errors. Numerical and real examples are pre- 
sented and discussed, and the S-PLUS software is available. 



1. Introduction. A residual analysis is a standard part of any regres- 
sion analysis, and it involves estimation and/or testing of a regression error 
distribution. This article is devoted to the error density estimation. Let us 
present the problem, its motivation and possible applications via a classical 
homoscedastic model, and then more complicated models will be introduced. 
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Following Fan and Gijbels [22], Hart [28] and Eubank [21], suppose that 
the statistician observes n independent and identically distributed (i.i.d.) 
realizations (Xi, Yi ),..., of the pair {X,Y) of random variables. 

Then the regression problem is to find an underlying regression function 
m{x) := E(Y\X = x) under the assumption that 

(1.1) Yi = m{Xi)+iu / = l,...,n, 

Xi, . . . ,Xn are i.i.d. predictors that are uniformly distributed on [0, 1], and 
ii-, - . . are i.i.d. regression errors that are also independent of the corre- 
sponding predictors Xi, . . . ,Xn. The model (1.1) is called a homoscedastic 
regression model with regression errors which are i.i.d. and independent 
of the predictors. If m(x) is a regression estimate, then ii/ := 1^ — rh(Xi), 
I = 1, . . . ,n, are called residuals. Patterns in the residuals are used to validate 
or reject an assumed model. If the model (1.1) is validated, then the next 
classical step is to look at the distribution of the regression error ^. Because 
realizations ■ ■ ■ ,£,n of regression errors are unavailable to the statistician, 
residuals are traditionally utilized as their proxies. They may be used either 
for testing a hypothesis about the underlying error distributions or for es- 
timation/visualization of the error density; see a discussion in the classical 
text by Neter, Kutner, Nachtsheim and Wasserman [32]. 

Surprisingly, despite the widespread use of residuals as proxies for unob- 
served errors, to the best of the author's knowledge, no result about optimal 
(in any sense) estimation of a nonparametric error density is known. For 
parametric settings, there exists a recently created Bayesian theory of esti- 
mation, and for nonparametric settings, a theory of consistent estimation is 
developed; the interested reader can find a discussion and further references 
in [8] and [27]. At the same time, there exists a vast literature devoted to den- 
sity estimation based on direct observations and to estimation of functionals 
of the error density; see [2, 14, 34, 37] and [31] where further references can 
be found. 

It is not difficult to understand why the literature on nonparametric er- 
ror density estimation is practically next to none: the problem is extremely 
complicated due to its indirect nature. In a nonparametric setting, the differ- 
ence between any regression estimate and an underlying regression function 
contains a random term and a bias. The bad news is that additive measure- 
ment errors may dramatically slow down optimal rates of density estimation; 
see [13, 14]. The good news is that, of course, additive errors in residuals 
become smaller as the sample size increases, and, thus, optimal rates may be 
preserved. This article shows that, fortunately for applied statistics, the good 
news prevails under the customary assumption that the regression function 
is differentiable and the error density is twice differentiable. 

It is well known in the nonparametric density estimation literature that 
rates alone are of little interest for practically important cases of small 
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datasets, and that rates should be studied together with constants; see the 
discussion in [30] and [14, 15]. Also, superefficiency and mimicking of ora- 
cles are important issues; see the discussion in [4, 5] and [14], Chapter 7. 
To explore all these issues simultaneously, it is convenient to employ an 
oracle approach suggested by Mark Pinsker more than two decades ago. 
Namely, suppose that an oracle (which will be referred to as a Pinsker or- 
acle and a particular one is defined in Appendix B) knows the underlying 
regression errors = 1, . . . ,n} and the oracle possesses a bouquet of de- 
sired statistical properties like sharp minimaxity, superefficiency, matching 
more powerful oracles that know an estimated error density, and so on. 
Then, if the statistician can suggest a data-driven error density estimate 
that matches the Pinsker oracle, this estimator simultaneously solves all the 
above-formulated problems. Moreover, Pinsker conjectured that a plug-in 
Pinsker oracle, based on residuals, may be the wished data-driven estimator. 
This article proves this long-standing Pinsker conjecture and, as a particu- 
lar corollary, establishes minimax rates and constants of the error density 
estimation. 

There are many practical applications of the error density estimation. Let 
us mention a few that will guide us in this article, (i) Well-known classical 
applications are data interpretation, inference, decision making, hypothe- 
sis testing, the diagnostics of residuals, model validation and, if necessary, 
model adjustment in terms of the error distribution, (ii) Another classical 
application, which is actually the main aim of any regression analysis, is the 
prediction of a new observation where the error density plays the pivotal 
role; see [32], Section 2.5. (iii) Goodness-of-fit tests are another natural ap- 
plication; see [2] and [28]. (iv) The error density is used in a sharp minimax 
regression estimation; see [12]. (v) The error density can be used in statis- 
tical quality control and classification; see [16], as well as a discussion in 
Section 2. 

The model (1.1) with a uniformly distributed predictor is definitely the 
most frequently studied in the regression literature, but a regression anal- 
ysis may reveal patterns that contradict this simple model. For instance, 
predictors may not be uniform and/or the errors may have different vari- 
ances. In this case either some remedial measures like a data transformation 
and/or weighting are applied (these remedies are not discussed here and 
the interested reader is referred to the books by Carroll and Ruppert [6] or 
Neter, Kutner, Nachtsheim and Wasserman [32]), or model (1.1) with an un- 
known design density p{x) is considered, or a more general heteroscedastic 
regression model is considered: 

(1.2) Yi=m{Xi)+aiXi)^i, l = l,...,n, 

where ct(x) is a (positive) scale function, the errors {^i, . . . , are i.i.d. with 
zero mean, unit variance and independent of the corresponding predictors. 
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and the predictors are i.i.d. according to an unknown design density p{x) 
supported on [0,1]. Following Pinsker's paradigm, a data-driven error den- 
sity estimator should be compared with an oracle that knows the underlying 
errors Ci,...,Cn- Here it is natural to use rescaled residuals as proxies for 
unobserved errors, and an implementation of this path implies estimation of 
both regression and scale functions and then dealing with additive and mul- 
tiplicative measurement errors. It will be shown that, under the assumption 
of the differentiability of each nuisance function and a known finite support 
of an estimated error density, the plug-in Pinsker oracle still matches the 
Pinsker oracle; the case of errors with infinite support is an open problem. 

Now we are in position to consider another assumption about models 
(1.1)-(1.2) that needs to be addressed: independence between regression er- 
rors and predictors. There are many known examples where this assumption 
does not hold; see particular ones in Section 2. Another customary situa- 
tion is where an underlying model is heteroscedastic, but the statistician 
assumes/believes that it is homoscedastic; an interesting particular example 
is presented in [28], pages 257-258. If we simply ignore a possible depen- 
dence between X and then what does our plug-in estimate exhibit or, in 
other words, what do residuals proxy? To the best knowledge of the author, 
there is no nonparametric literature devoted to this issue. This article shows 
that in this case the marginal error density is estimated and then all the 
above-discussed statistical results hold. In particular, this establishes that a 
plug-in estimation is robust toward a possible dependency between predic- 
tor and regression error, and this is an important conclusion for an applied 
residual analysis. 

Finally, let us note that the developed theory of plug-in estimation signif- 
icantly simplifies the problem of creating software because known statistical 
programs can be used directly. This article uses the S-PLUS software pack- 
age of Efromovich [14] which is available on request from the author. 

The structure of this article is as follows. Section 2 presents several nu- 
merical simulations and real practical examples that should help the reader 
to understand the problem, its solution and possible applications. Section 3 
contains mathematical results, and discussion is presented in Section 4. Ap- 
pendix A describes the main steps of proofs; complete proofs can be found 
in [16, 17, 19]. Appendix B is devoted to the Pinsker oracle, and it presents 
new results for the case of densities with infinite support. 

2. Several examples. Let us explain the above-described problem of error 
density estimation via several particular examples. 

Figure 1 presents a simulation conducted according to model (1.2) with 
functions described in the caption. The left-top diagram exhibits a scatter- 
gram, and the problem is to estimate an underlying error density. Asymp- 
totic theory, presented in Section 3, indicates that the S-PLUS software 
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Fig. 1. Simulated example of heteroscedastic regression (1.2) mt/i the regression function 
being the Normal, the scale function being the Monotone, the design density being the 
Uniform and the error density being the Bimodal; these underlying functions are defined 
in [14], page 18. The simulated scattergram is shown by triangles, the sample size n = 50 
is shown in the subtitle of the right-bottom diagram. The dotted lines show data-driven 
estimates, the solid lines show underlying functions, and the dashed line in the right-bottom 
diagram shows the oracle estimate based on underlying errors exhibited in the right-top 
diagram. 



package of Efromovich [14] can be used for calculating rescaled residuals and 
then error density estimation. Recall that the package supports Efromovich- 
Pinsker (EP) adaptive series estimates; see the discussion of regression, scale 
and density estimates in [14], Sections 4.2, 4.3 and 3.1. Let us explain how 
this software performs for the simulated dataset. The scattergram is overlaid 
by the EP regression estimate (the dotted line) and it can be compared with 
the underlying regression m{x) (the solid line). This particular estimate is 
not perfect and we can expect relatively large additive measurable errors 
in the residuals. The left-middle diagram exhibits the EP scale estimate 
(the dotted line), and it can be compared with the underlying scale func- 
tion (j{x) (the solid line). This estimate is also not perfect, so we can expect 
multiplicative measurement errors in the rescaled residuals shown in the left- 
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bottom diagram. The right column of diagrams exhibits the process of the 
error density estimation by the Pinsker oracle and the corresponding plug-in 
estimation. The Pinsker oracle is based on unobserved errors shown in the 
right-top diagram, and the plug-in estimate is based on rescaled residuals 
shown in the right-middle diagram. The oracle, the plug-in estimate and the 
underlying error density are shown in the right-bottom diagram by dashed, 
dotted and solid lines, respectively. 

As we see, due to the presence of measurement errors, the data-driven 
estimate performs worse than the oracle. The estimate is overspread, and 
this outcome is typical for data contaminated by measurement errors; see 
the discussion in [13] and [14], Section 3.5. Nonetheless, the estimate cor- 
rectly indicates the bimodal nature of the error. Keeping in mind that any 
nonparametric analysis is considered as a first glance at the data, the esti- 
mate readily indicates that the error density is not normal. This conclusion 
implies that classical methods of regression analysis, based on normal dis- 
tribution of errors, should be modified. For instance, a prediction error may 
be described by using the error density estimate. 

Let us complement this single simulation with an intensive Monte Carlo 
study where 500 identical simulations are conducted for each sample size 
from the set {25,50,75,100,150,200}. For each simulation, we calculate 
the ratio of ISEs of the Pinsker oracle and the estimate, and then for 
500 simulations, calculate the sample mean, sample median and sample 
standard deviation of the ratios. The corresponding results are as follows: 
{(1.05/0.93/0.74); (1.01/0.83/0.72); (0.96/0.81/0.64); (0.97/0.85/0.63); (0.94/ 
0.88/0.53); (0.96/0.87/0.56)}, where an element {A/B/C) presents the sam- 
ple mean, median and standard deviation for a corresponding sample size. 
Note that a mean ratio or median ratio smaller than 1 favors the Pinsker 
oracle. As we see, for the explored sample sizes, traditionally considered as 
small even for the case of direct observations, plug- in estimation performs 
respectively well. This tells us that Pinsker's proposal of comparing a data- 
driven estimator with an oracle is feasible even for the smallest samples. The 
interested reader can find more simulations and numerical studies in [16]. 

Our next simulation, exhibited in Figure 2, addresses an important issue 
of rescaling of residuals. It is fair to say that an applied regression analysis 
is primarily devoted to a homoscedastic regression, and a possible issue of 
heteroscedasticity is addressed by a data transformation and/or weighting 
rather than rescaling; see the discussions in [6] and [32]. We shall comment 
on this shortly, but now let us consider an example of a homoscedastic regres- 
sion (1.1) which is treated by the suggested software that always attempts to 
rescale residuals. A simulated scattergram is shown in the left-top diagram of 
Figure 2. Absolute values of residuals are shown by squares in the left-middle 
diagram, and they readily exhibit heteroscedasticity. We know that this het- 
eroscedasticity is stochastic in nature (look at the underlying horizontal scale 
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function shown by the soUd hne), but the software does not know this. Thus, 
it is of interest to understand how the software will perform with respect to 
a new oracle that knows that the model is homoscedastic. In other words, 
let us compare performances of the same density estimator where scaled 
residuals and residuals are plugged in. The left-bottom and right-middle 
diagrams exhibit by dots and squares rescaled residuals and residuals, re- 
spectively. The corresponding density estimates are shown by the dotted and 
long-dashed lines in the right-bottom diagram; the solid and dashed lines in 
this diagram exhibit the underlying normal error density and the Pinsker 
oracle (based on unobserved regression errors), respectively. As we see, in 
this particular case the rescaling had a positive effect on the estimation. In 
general, this cannot be the case, so let us conduct a numerical study identical 
to the above-described one with the only difference being that now we are 
comparing the use of rescaled residuals (the estimate) and residuals (a new 
oracle). The results are the following: {(0.99/0.91/0.78); (0.97/0.78/0.57); 
(1.02/0.78/0.65); (0.93/0.87/0.61); (0.98/0.88/0.57); (0.97/0.87/0.54)}. The 
study indicates that rescaling can be considered a robust procedure for ho- 
moscedastic regression, and Section 3 presents asymptotic justification of 
this empirical observation. 

The main purpose of the next simulation is to allow us to discuss the 
case of error depending on the predictor, and it also allows us to explore 
possible applications for statistical quality control. Assume that a process 
is inspected at ordered times Xi and corresponding observations are Yi, 
I = 1, . . . ,n. Recall that it is customary to say that a process is in control 
if its mean (centerline, regression function) and standard deviation (scale 
function, volatility) are constant. Keeping in mind that a traditionally as- 
sumed distribution of controlled variables is Gaussian, the latter implies a 
stationary distribution of the process; see the discussion in [10], Chapter 23. 
The two top diagrams in Figure 3 present a simulated process together with 
its two main characteristics. Because mean and standard deviation of the 
process are constant, the process is declared to be in control. However, even 
if the process is in control, it may not be strictly stationary. Thus, let us 
continue our analysis of the process. The third diagram shows us the es- 
timated marginal density of residuals (the dotted line), which exhibits a 
non-Gaussian shape (note that the underlying marginal density is shown by 
the solid line). If it is known that the process must be Gaussian, this error 
density raises a red fiag. If no action is required, as in the familiar "normal 
tool wear" example, then modified acceptance charts and hypotheses tests, 
based on the estimated density, should be suggested; see [10], Chapters 23 
and 25. To check the drawn conclusion about the changed distribution, the 
two bottom diagrams exhibit an onset error density for the first 50 observa- 
tions and an sunset error density for the last 50 observations. They support 



S. EFROMOVICH 
SCATTERGHAM AND REGRESSION FUNCTION UNDERLYIWG ERRORS 




ESTIMATED ERRORS 



RESIDUALS 




-2 


-1 


1 I 

DENSITIES 
















.■jf' 








'^^^^^^^ 


■3 


■a 


12 3 
n - 50 



Fig. 2. The use of rescaled residuals and residuals in a homoscedastic regression. The 
structure of the diagrams is similar to Figure 1 with the following modification. Rectangles 
in the left-middle diagram show absolute values of residuals. Rectangles and dots in the 
left-bottom diagram and the right-middle diagram exhibit residuals and rescaled residuals, 
respectively. The long-dashed line m the nght-bottom diagram exhibits the estimate based 
on residuals. 



our preliminary conclusion that the error distribution is changing. This ex- 
ample shows that nonparametric error density analysis can be a valuable 
addition to classical quality control methods. 

Now we are in a position to explore several real practical examples. The 
research division of BIFAR, a company with businesses in equipment and 
chemicals for wastewater treatment plants, has studied performance of a 
centrifuge for mechanical dewatering of a sludge produced by a food pro- 
cessing plant. The aim of the study has been to understand how a sludge, 
containing a fat waste, can be centrifuged. The top-left diagram in Figure 4 
presents results of a particular experiment. Index of fat is the predictor and 
index of centrifuging is the response. It has been established in [16] that the 
distribution of regression errors crucially depends on the predictor. Thus, 
we know a priori that we will visualize the marginal error density. 
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Fig. 3. Simulated example with the error distribution depending on the predictor. Here 
Y — 2 + e{X) where the error is a linear mixture of the Bimodal density, shown in Figure 1 
and having weight X, and the Normal density, shown in Figure 2 and having weight 
1 — X . The structure of the two top diagrams is similar to the left ones m Figure 2. The 
third diagram exhibits the estimated and underlying marginal densities. The two bottom 
diagrams show the marginal error density estimates for initial (onset) and final (sunset) 
50 observations. The estimates and underlying densities are shown by dotted and solid 
lines, respectively. 



Before discussion of the example, let us make the following remark about 
the software. It allows the statistician to estimate error densities with a 
known manually chosen finite support or infinite/unknown support. Inten- 
sive simulations in [16] show that, for smaller sample sizes, the former ap- 
proach benefits the estimation, while, for larger samples, both methods per- 
form similarly. In the simulated examples support has been unknown and, 
thus, the shown estimates are completely data-driven. A manual choice of 
support is not a difficult step in many applied settings because it is de- 
fined by specifications. In particular, for the BIFAR example, this approach 
implied the manual choice [—2.75,2.75] for the support. Due to the small 
sample size n = 47, this help is valuable and should be utilized whenever 
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Fig. 4. Centrifuging a food processing planVs waste. The sample size is n = 47. The 
scattergram is shown by triangles and it is overlaid by the EP regression estimate. 



possible. (The interested reader can find discussion of several manual and 
data-driven choices of support in [14], Chapter 3.) 

Now let us explore the BIFAR dataset. The left-top diagram in Figure 4 
exhibits the data and the estimated regression function. It is readily seen 
from this diagram that the regression is heteroscedastic. The bottom-left 
diagram contains the scale estimate, and it supports our visual conclusion 
about the heteroscedasticity. Let us note that neither the regression nor 
the scale estimate has been a surprise for BIFAR. The right-top diagram 
shows rescaled residuals; the diagram indicates that regression and scale 
estimates performed well and no heteroscedasticity can be observed. Also, 
after a closer look at the rescaled residuals, it is possible to note clusters 
in the residuals. This observation is supported by the estimated marginal 
density of errors shown in the right-bottom diagram. The density estimate 
reveals that it is a mixture of two distributions with the larger "left" clus- 
ter having a negative bias which "drags" the index of centrifuging down. 
This was a fantastic insight into the centrifuging process for BIFAR that, 
just for free, gave the company a new tool for the process analysis. As a 
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Fig. 5. Centrifuging a food processing plant's waste with added chemical. Sample size 
n = 63. 



result, while classical regression analysis traditionally describes a relation- 
ship between two variables by univariate regression and scale functions, it is 
proposed to complement the analysis by an extra univariate function — error 
density. Let us stress that it would be great to complement this analysis 
with a conditional density, but the sample is too small for bivariate function 
estimation. 

Based on this outcome, BIFAR decided to conduct a series of experiments 
where special chemicals were added to the sludge. Figure 5 presents (in the 
same format) results of a particular experiment. Note that the regression 
and scale functions are about the same due to robust performance of the 
centrifuge. On the other hand, the marginal error density indicates that 
the chemical is able to merge together the above-discussed clusters, and it 
also decreases the relative effect of the "left" cluster. This may explain, at 
least partially, the overall increase in the index of centrifuging caused by the 
chemical. Note that all these observations have been based on the analysis 
of univariate functions. Of course, it would be nice to evaluate an underlying 
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conditional density, but the dataset is too small for this and we are restricted 
to the univariate nonparametric analysis. 

Let us make a comment that connects these two practical examples with 
the simulated example in Figure 3. It is possible to imagine a situation where 
BIFAR observations are a time series where, due to some circumstances, 
index of fat increases. For instance, this may occur if a food processing plant 
illegally dumps its waste into a municipal sewage system. Then the error 
density will be the first indicator of such violation. Also, these two figures 
indicate the possibility of using the error density for classification purposes: 
the chemical is present or not, fat is present or not. In other words, an error 
density is an additional univariate characteristic (in addition to mean and 
scale functions) that may be useful in many settings of industrial statistics. 

We may conclude that, on the top of such classical applications in regres- 
sion analysis as prediction, model validation, hypothesis testing and optimal 
estimation of regression functions, error-density estimation is a valuable and 
feasible data- analysis tool on its own in time series, quality control and in- 
dustrial statistics. 

3. Optimal estimation of the error density. The aim of this section is 
twofold. First, we would like to establish minimax rates and constants of 
mean integrated squared error (MISE) convergence of error density esti- 
mates in homoscedastic models (1.1) and, if possible, in heteroscedastic 
models (1.2) with errors depending on predictors. Recall that, even for a ho- 
moscedastic model, minimax rates are unknown; see [8]. Second, we would 
like to suggest a data-driven (adaptive) estimator that attains the mini- 
max convergence. Ideally, to support the classical methodology of applied 
regression analysis and to employ available statistical software, such an esti- 
mator should be a known (for direct observations) density estimator based 
on appropriately calculated residuals, that is, it should be a plug-in density 
estimator. 

Two classical models of errors will be studied: models of errors with a 
known finite support [a,a-|-6] and errors with infinite support (—00,00). 
Recall that we discussed particular examples in Section 2. We need to make 
a comment about the finite interval case. It will be convenient to evaluate 
the density over a fixed interval, and a customary interval is [0, 1]. In models 
(1.1) and (1.2) the error support cannot be [0, 1] because = 0. Thus, we 
employ a familiar location-scale transformation, introduce a new random 
variable e — a) /h and then study the equivalent problem of estimation of 
the density f^{u) of the transformed error e instead of the density 6~^/^([u — 
a]/b) of ^. The approach of estimation of a rescaled random variable is 
discussed in detail in [14], Chapter 3. From now on we omit the superscript 
e in the density, denote by / the density of e, and refer to it as the error 
density (of interest). 
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In what follows, with some obvious abuse of notation, we shall always 
present results for finite and infinite supports simultaneously. 

3.1. Model and assumption for the case of finite support. The studied 
regression model is (1.2), where the error may depend on the predictor X. 
Neither the regression function m nor the scale function a is supposed to be 
known. The observed predictors (Xi, . . . are i.i.d. according to an un- 
known design density p{x) supported on [0, 1]; the regression errors ^i, . . . , 
do not take on values beyond a known finite interval [a, a + 6] and may de- 
pend on the corresponding predictors according to an unknown conditional 
density b~^^p{[l' — a]/ b\x), G [a, a -|- 6]; the pairs {Xi,S^i), . . . ,{Xn,Cn) are 
supposed to be independent and identically distributed. The marginal den- 
sity of the rescaled errors £i = [^^ — a]/b is the object of interest; that is, the 
issue is to estimate the density /(n) = 'ip{u\x)p{x) dx, u G [0, 1], based on 
n pairs of observations {{Xi,Yi), . . . , Yn)}- 

Assumption A. The regression function m(x), the design density p{x) 
and the scale function ^{x) are differentiable and their derivatives are bounded 
and integrable on [0, 1]. Also, min2;g[o^i] min(a"(x), p(x)) > and /q p{x) dx = 
1. 

Assumption B (Finite support). Model (1.2) is considered where the er- 
rors may depend on the predictors. Pairs of observations (Xi, Yi), . . . , (X„, 1^) 
are i.i.d. The conditional density is such that exists, is 

bounded and integrable on [0, 1]^, and ■0(n|x) = for u ^ (0, 1), x G [0, 1]. 

Assumption C. For i.i.d. observations (errors) from a den- 

sity f{u) with unit support [0,1] or infinite support (—00,00), Appendix B 
defines a data-driven density estimate /p(n, Z[), := (Zi,...,Zr). This 
estimate, based on underlying errors, is employed as the Pinsker oracle. It 
is assumed that the statistician knows all parameters of this estimate. 

3.2. Model and assumption for the case of infinite support. Due to the 
complexity of the case, the studied model is homoscedastic regression (1.1) 
where the error ^ is independent of the predictor X. Neither the regression 
function m nor the design density p of the predictors is known. The problem 
is to estimate the density f{u) of ^ based on n i.i.d. pairs of observations 
{Xi,Yi), . . . ,{Xn,Yn). In what follows a reference to the above-formulated 
Assumption A means that (t{x) = 1, x G [0, 1]. 

Assumption B (Infinite support). Model (1.1) is considered with the er- 
ror being independent of the predictor and pairs of observations (Xi, Yi), . . . , 
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{Xn^Yn) being i.i.d. The error density f[u) is supposed to be square inte- 
grable, that is, /"^{u) du < oo, and its characteristic function h{v) := 
/^/('")e^™ du satisfies JZo^^Hv)]'^ dv < oo. 

3.3. Notational convention. Several sequences in n are used: 6„ := 4 + 
lnln(n + 20); n2 := n — 3ni; ui is the smallest integer larger than n/bn] 
S := Sn is the smallest integer larger than n^/^. In what follows we always 
consider sufficiently large n such that min(ni,n2) > 4. Cs denote generic 
positive constants, o(l) ^ as n — > oo, and integrals are taken over [0, 1] or 
(—00,00), depending on the support considered. Also, (x)-|_ :=max(0,x). 

3.4. Plugged-in residuals. The aim of this section is to explain a proce- 
dure for the calculation of plugged-in residuals. Four different subsamples 
are used to estimate the design density, the regression function, the scale 
function and the error density, respectively (the author conjectures that all 
n observations may be used for estimation of each function and the result 
will still hold). The first ni observations are used to estimate the design 
density p{x), the next ni observations are used to estimate the regression 
function m{x), the next ni observations are used to estimate the scale func- 
tion a{x), and the last n2 observations are used to estimate the error density 
of interest f{u). Note that n2 > [1 — 3(6~^ + n~^)]n and, thus, using either 
712 or n observations implies the same MISE convergence. The design density 
estimate p{x) is a truncated cosine series estimate, 

/ S ni 

p(x) =max ,n^^'^'^ips{Xi)ips{x) 

\ s=Ol=l 

The regression estimate m{x) is also a truncated cosine series estimate, 

s 

(3.1) m{x) = ^ksips{x), 
where 

2ni 

(3.2) ks=n^^ yip-\X{)ips{Xi). 

Z=ni+1 

Under model (1.2), the scale estimate (y{x) is also a truncated cosine series 
estimate, 

(3.3) (t(x) = min(max(cj(x), ft""*^), 6„), 

where ct(x) = \/ {a^{x))j^ and (T^(x) is a regression estimate defined identi- 
cally to (3.1)-(3.2), where pairs {{Xi,Yi),l = ui + 1, . . . ,2ni} are replaced 
by {{Xu [Yi - m{Xi)f),l = 2ni + 1, . . . ,3ni}. 
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Then, for finite support (recall that in this case a heteroscedastic model 
is considered) we define rescaled residuals 

(3.4) ei:=——— -, I = n - n2 + I, . . . ,n. 

ba{Xi) b 

For infinite support (in this case a homoscedastic model is considered) we 
define residuals 

(3.5) ir.= Yi-m{Xi), l = n-2ni + l,...,n. 

Now we can use a unified notation for the residuals and underlying errors. 
Denote by Z a vector {in-m+i, ■ ■ ■ ,^n) or a vector {^,n-2ni+i, ■ ■ ■ ,(,n) for 
finite and infinite support cases, respectively. Similarly, Z denotes a vector 
of transformed errors (ei, . . . , e„) or a vector of errors (^i, . . . , ^„) for finite 
and infinite support cases, respectively. Note that Z is known to the Pinsker 
oracle but not to the statistician. 



3.5. Main assertion. It is possible to show that, under the given as- 
sumptions, the MISE of the plug-in Pinsker oracle fp{u, Z), defined in Ap- 
pendix B, can asymptotically match the MISE of the Pinsker oracle fp(u, Z) 
based on underlying regression errors. 

Theorem 1. The cases of finite and infinite supports are considered 
simultaneously. Suppose that Assumptions A, B and C hold. Then, for all 
sufficiently large samples such that min(ni,n2) > 4, the MISE of the plug-in 
Pinsker oracle satisfies the Pinsker oracle inequality 

E ({fp{u,'L)-f{u)fdu 

(3.6) 

< (1 + P* \n-\hn))E I {fp{u, Z) - f[u)f du + P*hln'\ 
where P* is a finite constant. 

Recah that 6„ = 4 -h In ln(n 20) and, thus, P*bln~^ = o(l)ln(n)n~\ that 
is, the second term in (3.6) is negligible with respect to minimax MISEs of 
analytic and differentiable densities which are at least of order ln(n)n~^. Also 
note that Assumptions A and B involve no interplay between smoothness of 
the error density and smoothness of the triplet of nuisance functions (design 
density, regression and scale). This allows us to conclude that residuals can 
be considered as proxies for unobserved regression errors, and this conclusion 
supports the customary methodology of applied statistics. 

The obtained result also allows us to establish minimax rates and con- 
stants of MISE convergence; they are presented below. 
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3.6. Optimal rates and constants of MISE convergence. This section an- 
swers several classical questions about optimal estimation of a nonparamet- 
ric error density. To the best of the author's knowledge, so far no results 
about optimal rates have been known even for the simplest homoscedastic 
regression model (1.1) with uniformly distributed predictors. 

Here we are considering a Sobolev (a-fold differentiable) class S{a^Q) of 
error densities and an analytic class .4(7, Q) of error densities. These classes 
are defined and discussed in Appendix B for finite and infinite supports, and 
let us note that the same notation is used in both cases. 

Corollary 1 (Differentiable error density). Suppose that the assump- 
tions of Theorem 1 and (B.12) of Appendix B hold and a > 2. Then the 
plug-in Pinsker oracle is sharp minimax over Sobolev error densities and all 
possible oracles, that is, 

sup E [ [rn{S{a,Q)){fp{u,Z) - f{u))f du 

feS{a,Q) J 

(3.7) =(l + o(l))inf sup E [[rn{S{a,Q)){f{u,Z)-f{u))fdu 

f feS{a,Q) J 

= (1 + 0(1)), 

where the infimum is taken over all possible oracles f based on unavailable- 
to-the- statistician errors Z and parameters a and Q, the sharp normalizing 
factor is 

(3.8) r„(5(a,Q)) := [n2"/(2"+i)/P(a, Q)] V2 
and P{a,Q) is the famous constant of Pinsker [35] , 

(3.9) P{a,Q) := {2a + l)[7r(2a + l)(a + i)a-i]-2°/(2a+i)gi/(2a+i)_ 

Corollary 2 (Analytic error density). Suppose that the assumptions of 
Theorem 1 and (B.12) of Appendix B hold. Then the plug-in Pinsker oracle 
is sharp minimax over analytic error densities and all possible oracles, that 
is, 

sup E f [rn{A{^,Q)){fp{u,Z)-f{u)fdu 
/e^{7,Q) J 

(3.10) =(l + o(l))inf sup E f[rrr{A{j,Q))if{u,Z)-fiu))fdu 

f /e^(7,Q) 

= (1 + 0(1)), 

where the infimum is taken over all oracles f based on unavailable-to-the- 
statistician errors Z and parameters 7 and Q, and the sharp normalizing 
factor is 

(3.11) r„(^(7,Q)) := [{27,^)n/\n{n)\^'\ 
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The results establish that, whenever Assumptions A and B hold, minimax 
rates and constants of MISE convergence for the error density estimation are 
the same as for the case of directly observed errors. Moreover, the minimax 
estimator is a plug-in one based on appropriately calculated residuals, and it 
satisfies the oracle inequality. These results verify the long-standing Pinsker 
conjecture. 

4. Discussion. 

4.1. It is an important fact that Assumption A (about properties of the 
regression function, scale function and design density) and Assumption B 
(about properties of an estimated error density) do not interplay. Also, the 
minimal restrictions on smoothness of all these functions are classical in the 
nonparametric literature; see [21, 22, 28, 37]. 

4.2. The assumption v^\h{v)\'^ dv < oo about the characteristic func- 
tion in Assumption B (infinite support) is identical to the assumption that 
the second generalized derivative of f{u) is square integrable; see [33], page 35. 
Thus, the assumptions for error densities with finite and infinite supports 
are similar. 

4.3. Let us heuristically explore the presented results from the point of 
view of the prediction of a new observation Y* at a random level X of the 
predictor; see [32], Section 2.5. Whatever prediction topic is considered (hy- 
pothesis testing, confidence intervals, etc.), the error density plays a crucial 
role. Consider the classical model (1.1), and recall that a traditional applied 
approach/paradigm is to assume that Y* = m{X) + rj, where m is a regres- 
sion estimate, ry is a prediction error with a density /, and the regression 
and error density estimates are based on the previous n observations. The 
prediction problem resembles the one considered in the article, so it is nat- 
ural to explore how the regression and error density estimates suggested in 
Section 3 will work together in the prediction problem. We note that, accord- 
ing to (1.1), the prediction error can be written as r] = m{X) — m{X) + ^; 
thus, to verify the paradigm mimics r/," we need to understand how the 
difference m(X) — ■m{X) affects the density of rj. Recall that this difference 
has a classical decomposition into a zero-mean random component and bias. 
To simplify the heuristic, let us consider only the effect of bias; denote the 
bias as b{X). Under Assumption A, the squared bias can be (at most) of or- 
der n~^/^. This implies that the characteristic function hb{v) := Ele^^'^^'^^} 
of the bias is close to 1 for frequencies \v\ < o{l)n^^^, and we can con- 
clude that on these frequencies the characteristic function of ^ does mimic 
the characteristic function of r]. [Note that beyond these frequencies the 
characteristic function hb{v) may be separated from 1.] Recall that at least 
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twice-differentiable densities, considered in this article, are estimated only 
on frequencies |i; | < 0(l)n^/^. As a result, the paradigm holds (of course, 
we have considered only the bias effect, but similar arguments are applied 
to the random component). On the other hand, let us relax Assumption B 
and assume that the error density is only differentiable. Then a rate-optimal 
error density estimation requires evaluation of its characteristic function on 
frequencies \v\ < 0(l)n^/^, and then the distributions of r] and ^ may be 
different. Of course, in this case we can employ a nonoptimal error density 
estimation, which involves evaluation of the characteristic function only on 
frequencies ["yl < o(l)n^/^. The latter preserves the paradigm at the expense 
of the error density estimation. What we have observed is the onset of ir- 
regularity in the error density estimation, and this is an interesting and 
challenging topic on its own. 

4.4. There will be a separate paper about the case of infinite support 
and heteroscedastic regression. Due to the presence of multiplicative mea- 
surement errors in residuals, this case requires an additional assumption on 
the tails of the error distribution, and it is a technically involved matter to 
suggest a mild assumption. 

4.5. The split-data approach, used for estimation of the nuisance func- 
tions and the error density, can be replaced by using all n observations for 
estimation of all functions involved. The corresponding proof becomes much 
more complicated, and the interested reader is referred to [16]. 

4.6. All assertions hold if, in truncated cosine series estimates of the 
design density, regression and scale, defined in Section 3.4, the cutoff S is 
changed on n^^^ln(bn)- Then, under Assumption A, all these estimates are 
undersmoothed; that is, they have a bias which is smaller than an optimal 
one. This is an interesting remark for the reader who would like to under- 
stand the variance-bias balance in these estimates. Also, Efromovich [16] 
shows that rate-optimal adaptive estimation of nuisance functions can be 
also used. Thus, there is a robust choice among Fourier series estimates of 
the nuisance functions. On the other hand, it is an open problem to explore 
nonseries estimates like kernel or spline ones. Some numerical results can be 
found in [16]. 

4.7. For density estimation based on direct observations, there is a vast 
literature on closely related topics like censored observations, biased data, 
observations contaminated by measurement errors, estimation of function- 
als, ill-posed settings, estimation under a shape restriction, and so on. The 
obtained results indicate that it is reasonable to conjecture that many of 
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the known direct-data results can be extended to the error density es- 
timation as well. For instance, Van Keilegom and Veraverbeke [38] con- 
sidered the problem of consistent error density estimation in a censored 
regression; using [4, 15, 18], it is reasonable to conjecture that optimal 
nonparametric results can be obtained for censored and biased regression 
models. 

4.8. The reason for considering the case of dependent errors and pre- 
dictors is threefold. First, this is a rather typical case in applications; the 
obtained result shows that in this case the marginal error density is ex- 
hibited by residuals. Second, we can conclude that the plug-in EP esti- 
mation is robust. Finally, let us stress that small datasets may not al- 
low the statistician to evaluate a conditional density; then the univariate 
marginal density becomes a valuable tool for data analysis. Let us fin- 
ish this remark by answering a question that the author was frequently 
asked during presentation of the result. Is it possible that the marginal er- 
ror density is normal and the conditional density (of regression error given 
predictor) is not? The answer is "yes." As an example, define a bivariate 
density ip{u,x) := /(n) + 5X{u)fi{x), where f{u) is the standard normal 
density, \X{u)fi{x)\ < 1, fi{x)dx = J^^X{u)du = 0, and X{u) = when- 
ever f{u) < 6. There are plenty of such functions and, under the given as- 
sumptions, ip{u,x) is a valid bivariate density on (—00,00) x [0,1] with the 
standard normal marginal density f{u). Obviously, the conditional density 
ip{u\x) :=Tp{u,x) is not necessarily normal, and this verifies the assertion. 
The conclusion is that, even if the marginal distribution of residuals may be 
considered normal, unfortunately this does not imply the normality of the 
conditional distribution. 

4.9. Brown, Low and Zhao [5] introduced the notion of nonparametric 
super efficiency, and they noticed that the Pinsker oracle (EP estimate) was 
super efficient; see also [18]. This fact, together with Theorem 1, immediately 
implies that the plug-in Pinsker oracle is also superefficient. 

4.10. Let us note that plug-in estimation obviously enjoys its renascence 
in nonparametric estimation theory; see the discussion in [3] and [23]. A typ- 
ical nonparametric plug-in setting is about optimal estimation of a func- 
tional. In this article a plug-in approach is caused by the indirect nature of 
observations, and, thus, it presents a new chapter in the theory of plug- in 
estimation. 

4.11. It is a very interesting and technically involved problem to esti- 
mate the error density for the model with measurement errors in the pre- 
dictors; see [7]. 
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4.12. The results hold for the case of a fixed-design regression; see [16]. 

4.13. Let us comment on our main assumption about independence of 
pairs of observations {Xi,Yi), . . . , Yn) with the typical counterexample 
being the case of dependent errors. The author conjectures that, based on 
the result of Efromovich [14], Section 4.8, even errors with a long memory 
should not affect the corresponding optimal rates. On the other hand, the 
outcome should change dramatically if a fixed design regression (say a time 
series) is considered. For this setting, the result of Hall and Hart [26] may 
be instrumental. 

4.14. It is an open and practically interesting topic to develop optimal 
Bayes and conditional distribution methods and then compare them with 
the developed plug-in estimator for the case of small datasets. 

4.15. Wavelet regression is a popular tool for solving many practical 
problems involving spatially inhomogeneous regressions. It is an open and 
interesting problem to explore the possibility of using wavelet-residuals as 
proxies for underlying regression errors. 

APPENDIX A: PROOFS 

Proof of Theorem 1. Only the main steps of the proof are presented; 
the interested reader can find a detailed proof in [17, 19]. We begin with a 
more complicated case of finite support. Recall that the Pinsker oracle fp is 
defined in Appendix B and it is based on pseudo-statistics {p,k,0j}] in what 
follows we use the diacritics "bar" or "hat" above //^ and 9j to indicate 
a pseudo-statistic (oracle) based on underlying errors or a statistic based 
on observations, respectively. Set Z* := (e„__„2+i, . . . , e„). Then a straight- 
forward calculation, based on n2 > [1 — 3(6~^ + n~^)]n, establishes a plain 
inequality EJifp{u, Z*) - f{u)f du < {1 + Cb-^)E J{fp{u, Z) - fiu)f du. 
As a result, in what follows we are assuming that pseudo-statistics ji^ and 
Oj are based on Z* in place of Z, that is, plugged-in residuals correspond 
to errors used by the Pinsker oracle. Also recall that the oracle uses EP 
blockwise-shrinkage with = k"^ and tk = ln~^(2-|- fc). Keeping this in mind 
and using the Parseval identity, we write 

E [\fp{u,Z)-f{u)fdu 
Jo 

fc=ljGSfe k>Kj&Bk 
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(A.l) 



K 



k=lj&Bk 



k>Kj&Bk 



< 



K 



(l + ln-i(6„))£;^ Y: e e 

k=lj^Bk k>Kj£Bj 

+ 2(l + ln(6„)) 

J2 E EfiliOj - e,f + E E ^(Afc - -^k?e] 



k=ljeBk 



(1 + ln-\bn))E [\fpiu, Z*) - fiu)f du 
Jo 

+ 2(l + ln(6„)) 

f: 5: Efilie, - e,f + f: Eip, - Uk? e 

-k=lj&Bk k=l jGBk 



We need to evaluate the second term on the right-hand side of (A.l). 
Recall that (, = be + a and write 



: n. 



■ n. 



' E [^j{[Yi-m{Xi)]/ba{Xi)-a/b)-^j{ei)] 



l=3ni+l 
n 



E [v5i([rn(^/)+'T(X06-"i(X0]/6-7(Xz)-a/6) 



(A.2) 



: n. 



i=3ni+l 



E 

«=3ni+l 



m{Xi) - m{Xi) , ^ (j(Xz) - <t(XO 



+ 6 



ba{Xi) 

Using the Taylor expansion for the cosine function, we can write 



^3? = \^2^ E [-^jHi2^'^sH^jei) 



/=3ni+l 
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- {l/2){^ jfHf 2^1^ cos{^ jei) 
+ {l/6){7TjfHf2^/^sm{Trjei) 
+ (l/24)(7rjf //f2V2cos(7rjQ) 



<C 



•2 -2 
J 712 < 



Hism{TTjei) 

. i=3ni+l 



(A.3) 



^ Hfcos{7r jei) 

i=3ni+l 



^ Hf sm{7r jei) 

l=3ni+l 



+ iS^S E cosi-K jei) 

U=3ni+1 



U=3ni+1 J 

In the first equality we denoted by z^^'s generic random variables satisfying 
\i>l\ < 1, and 



(A.4) 



miXi)-miXi) ^a{Xi)-a{Xi) 
Hi := r-TTTTT h 



^j)^ is converted into the analysis of the 
nonparametric regression and scale estimates. Evaluations are lengthy and 
technically involved (see them in [17]), and they imply 



As we see, the analysis of 



K 



(A.5) 



E E{e,-e,f<chnn-\ 

k=ljeBk 



Note that < 1, so we have evaluated the first sum on the right-hand side 
of (A.l). Now let us consider the second sum. Write 



(A.6) 
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+ 



ft2 



9/0 + 712 

--■.Di{k)+D2{k) + D3{k). 



Here we have used the notation @k '■= J2j£Bk(^j ~ "^2 ) ®fe '■— 

Let us consider, in turn, these three terms, beginning with Di{k). Skipping 
the indicator functions, we are going to evaluate 



Dl{k):=Lk 



9^ + 772 ©fc + n2 
Lkn^^Qk-Qk? 



(9fc + 772-^)(9,. + 772-^)2' 

Using the Cauchy inequahty, we can write, for any Cfc > 1, 



{Qk-ek? = Lf 



(A.7) 



2cuY.i^3-0,? + ^k' E ^. 

2 ' 



E(^.-^.)' 



E^l 



Note that Y^jeB^ ^ = Lk{@k + n2^), to get 

(9fc+772 )(9fc + 772 )2 

= :Dl,{k) + DUk). 



2 "-fc 



9fc + 772 



Set 4 = Lkk'^+'^, < d < 1, and denote Di2{k) := Dl2{k)I{Qk > tk 
I{Qk>tkn2^). We get 



77o X 



(A.8) f;D,,W<2„-£t-'-!£^i^i^^ 

9fc + 772 



<C77~^ 



k=l 



k=l 



24 S. EFROMOVICH 

It is a more complicated task to evaluate Dli{k). Denote 

and write 



(A.9) 



EDii{k)<Cn2k^+'^E 



The squared difference (dj — Oj)'^ was considered in (A. 3), and a calculation 
yields 



(A.IO) E 



Using this inequality in (A.9) implies Z^^i -^-^ii(^) I^Cn ^6^. In its 
turn, this, together with (A. 8), yields 

K K 
E Y Di{k) < J2 E{Dii{k) + Di2{k)} < Cn-^bl 

k=l k=l 

Now we consider the second term D2{k) in (A. 6). Write 



D2{k) 



e 



(0fc + 



-Y:^[I{tkn2 < @k < ^tkn^') 

+ I{&k > 2tfcn2 ')/(Gfc - Bfc > efc/2)]. 



Recall that cf. = Lkk^^'^, 0<d<l. Then using (A. 7), we get 



(A.ll) iek-Qky<CL^'ci 



1 2 



l\2 



This, together with Chebyshev's inequality and (A.IO), yields 
ED2{k) < Cn^^LktlE{I{tkn^^ < Gfc < 2tkn^^)} 

+ Cn^^LkE{I{@k - Bfc > Gfc/2)/(efc > 2tfcn2-^)} 
(A.12) < Cn^^LktlE{I{tkn^^ < < 2*^7^2 ')} 

+ Ck^+Hfin^^n"'^/^^ lni°(n) + n^^bnk"^) 

^2 '^^fc' 



+ CnyH7^k-^- 
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Let us evaluate the term E{I{tkn2^ < ©fe < '^tkn2^)}. Denote @k '■= 
L~^^ X J2j£Bk(^j ^i^d 9j := f{u)ipj{u) du. Then using Lemma 1 in [11], 
together with some algebra, implies 

Eilitku^' < Qk < 2ifen2 ')} 

</((l/2)tfcn2^<efc<4tfcn^^) 

+ E{I{Gk - Bfc > (l/2)tfcn2-i)}/(efc < (l/2)tfcn2 
+ E{I{@k -@k> {l/2)@k)}I{@k > 4.tkn2^) 

B + 712 

Let us recall a familiar blockwise Wiener oracle, which knows regression 
errors and an estimated density of errors and employs optimal shrinkage 
coefficients //fc = &k/{®k-^n'~^). The Wiener oracle is the benchmark for the 
Pinsker oracle, and its MISE is proportional to rig ^ Z^fc^i LkQk/{Qk + ^2 
see the discussion in [11, 14] and Appendix B. Then, combining the results, 
we get 

^ ED2{k) < Cn^' ^kLk- ^/((l/2)tfcn2-^ < 6,. < UkU^^) + Cbln^^ 

k=l k=l + ^2 

<Cn2X + C tkLk—^^I{{l/2)tkn2^Gk<4tkn2^) 
. ^,2/3 ®k + "2 

k>b„' 

+ Cbln-^ 

< C\n-\bn)E l\fp{u, Z*) - f{u)f du + Chln~\ 
Jo 

Now we consider the third term D2,{k) in (A. 6). Write 
Dz{k) < LkQkI{Qk>2tkn2^)I{Qk<tkn2^) 

+ 2tkLkQkI{tkn2^ < Qk < 2ifcn^i)/(Gfc < t^ns ^) 
=:D3i{k)+Ds2{k). 
To evaluate D3i(/c), we note that 

D3i{k) < 2Lk\ek - OklliOk - Ofc > 4n^')/(e,. < tfcn^i). 
This relation, the Chebyshev inequality, (A. 10) and (A. 11) imply 

ED3iik)<CLknf[Lfclin-^/^^ln'^{n) + k-^)+cf]/itkn2') 
< Cnf 2n2t^i[A;i+°'(n-3/5 IniO(n) + /fc-^) + ifc-^"^] 
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and, thus, 



K 
k=l 

To evaluate D^2{k), we write 
D^2{k) = 2tkLkekI{tkn^^ <Qk<2tkn^^)I{ek<{l/2)tkn^') 

+ 2tkLk@kI{tkn2^ < Gfc < 2tfcn^^)/((l/2)tfcn2 ^ < ©fc < ifc^^^) 

= :D32iik)+D322ik). 

The first term is evaluated in the same way as the term D31 was evaluated, 
and we get J2k=iED32i{k) < Cbln-^. The second term can be estimated as 
follows. First, we note that ©^/(tfcn^^ < ©fe < 2tkn2^) < 2tkn2^ ■ Then we 
realize that the term was evaluated earlier; see the first term in (A. 12). This 
implies j:LiED-s22{k) < Cln-\bn)E J^{fp{u,Z*) - f{u)fdu + Cb^ 



n 



Combining the results, we get 

K 



J2 EDsik) < C\n"\bn)E f\fp{u, Z*) - f{u)f du + Cbln-\ 
k=i 

Then, by plugging the obtained estimates for Di{k), D2{k) and D3{k) 
into (A. 6), we obtain 



(A.13) 



< Cln~\bn)E [\fp{u, Z*) - f{u)f du + Cbln-\ 
Jo 



Using (A. 5) and (A.13) in (A.l) verifies Theorem 1 for the finite support 
case. For the infinite support case, we set Z* := (e„_2ni+i, ■ ■ ■ ,£n), consider 
pseudo-statistics h{u) and flk based on Z*, and then write, similarly to (A.l), 



E 



{fp{u,Z)-f{u)Ydu 

K 



/oo f 
^~^E/^M Re(/i(t;)e-^™)di; 
-00 [ 



K 
k=l 

k>K 



Bk 



Re{h{v)e 



\dv 



n 2 



Re(/i(^;)e-™")d^; 



du 
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(A.14) 



K 



<{l+ln-\bn))E 
+ 2(l + ln(6„)) 



fikh{v) 



h{v)\'^ dv + TT-^ I \h{v)\^dv 

1,^ i^-' Bi. 



k>K^ 



oo 

-oo 
K 



{fp{u,Z*)-f{u)fdu 



k=l 



B, 



K 



\h{v) - hiv)]"^ dv 

\h{v)\'^dv 



k=l 



Bk 



If we compare the right-hand side of (A.14) with the right-hand side of (A.l) 
and recall the steps taken after (A.l), then it is easy to recognize that the 
difference is in analyzing h{v) — h{v) in place of 9j — 9j. Another remark is 
that now the second term in (A. 4) vanishes because the unit scale function 
is known. These remarks allow us to follow the above-outlined proof and 
verify (3.6); details can be found in [19]. Theorem 1 is verified. □ 



Proof of Corollaries 1 and 2. First, it is checked that the con- 
stant P* in (3.6) is uniformly bounded over the considered function classes. 
Second, it is easy to check that (3.6) holds with the estimate and the oracle 
exchanging places. Then using Theorem 1, Corollaries Bl and B2 of Ap- 
pendix B, together with some algebra, verifies these corollaries. Details can 
be found in the technical reports. □ 



APPENDIX B: THE PINSKER ORACLE 

The Pinsker oracle is a data-driven density estimator possessing some 
desired statistical properties for the case of directly observed regression er- 
rors; in other words, it is a traditional density estimator whose oracle fea- 
ture is in the knowledge of regression errors that are obviously unavailable 
to the statistician. For the case of direct observations and finite support, a 
good candidate for an estimator is the Efromovich-Pinsker (EP) data-driven 
(adaptive) procedure, which possesses an impressive bouquet of asymptotic 
properties of being: (a) minimax over a vast set of function classes which in- 
cludes parametric, differentiable and analytic ones; (b) super efficient; (c) an 
excellent plug-in estimate; (d) applicable to filtering, regression and spectral 
density settings due to equivalence results. The interested reader can find 
discussion in [5, 9, 11, 14, 34, 39]. On the other hand, no results are available 
about a similar estimator for the case of a density with infinite support. The 
primary aim of this appendix is to develop such an estimator and explore its 
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properties, and the secondary aim is to remind the reader of known resuhs 
for the case of finite support. 

We begin with the primary aim. Consider a density f{z), — oo < z < oo, 
such that J^^f'^{z)dx < oo. The problem is to estimate f{z) under the 
MISE criteria when n i.i.d. reahzations from / are given. The 

underlying idea of EP estimation, translated from a finite support set- 
ting into the infinite one, is as follows. First, the characteristic function 
h{v) := e^""^ f (z) dz is estimated by its empirical counterpart h{v) : = 
J2?=i 6™^' • Second, the estimate is "smoothed" by a statistic (filter) 
which is the main "ingredient" of the EP method defined shortly. Fi- 
nally, a smoothed empirical characteristic function ji[v)h{v) is inverted to 
obtain a density estimate f{z) := (2^)"^ fi{v)h{v)e~™^ dv. Now we are 
in position to explain the underlying idea of choosing the EP smoothing. 
Consider a real even function fi{v) : (—00,00) [0, 1], set f{z) := (27r)~^ x 
n{v)h{v)e~'^'"^ dv, and evaluate the MISE of this estimate using the 
Plancherel identity. 



E \f{z)-f{z)fdz 
J —00 

(B.l) ={2^y^E \^i{v)h{v) - h{v)\^ dv 



/oo 
\n{v){h{v) - h{v)) - {I- ii{v))h{v)\^dv. 
-00 

Recall two familiar properties of the empirical characteristic function: 

(B.2) Eh{v) = h{v), E\h{v)-h{v)\^ = n~^{l-\h{v)\^). 

This, together with simple algebra, shows that a smoothing function (oracle), 

.RON I^WI^ 

^ ' ^ ^ |%)|2+n-i(l-|/i(r;)|2)' 

minimizes (B.l). The reader might notice that this smoothing function is 
the analog of the famous Wiener filter, and this is the reason why it also can 
be referred to as a filter; see [29], Chapter 10. The function ijl*{v) is unknown 
to the statistician, but, using (B.2), it can be estimated by the statistic 

(B.4) ~^{v) := > (1 + t > 0; 

\h{v)\^ 

here /(•) is the indicator function and t is a threshold level (1 + t is of- 
ten called a penalty). Hard thresholding (which is a trademark of the EP 
smoothing) is used to make the statistic a bona fide smoothing function. 
Unfortunately, it is not difficult to verify that this naive mimicry is not suf- 
ficiently accurate. Thus, by recalling that any characteristic function h{v) is 
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continuous and, thus, /U*(w) is continuous, it is natural to approximate ^i*{v) 
by a piecewise constant function and then estimate that function. This is the 
underlying idea of the EP blockwise procedure. Note that fi*{v) is an even 
function, and this allows us to work only with v G [0, oo). We divide the half- 
line [0,00) into a sequence of nonoverlapping blocks (intervals) Bi,B2,... 
with corresponding lengths := /g^ dv > 0, and then consider a smoothed 

empirical characteristic function h{v) = J2k^i IJ'kh{v)I{v £ Bk). Similarly to 
(B.1)-(B.3), we can establish that the MISE of the corresponding density 
estimate is minimized by the oracle 

. L-,' Jj,Jh{vr dv 

L^'lBjh{vWdv + n~^l-L^'j^Jh{vWdvy 

Note the striking similarity between (B.3) and (B.5). Similarly to (B.4), the 
proposed estimate of the optimal /x^ is 

Afc := ' i I[L7^ / \h{v)\^dv>{l + tk)n'^], 

(B.6) 

Then the EP density estimate is defined as 

(B.7) r(z):=^-i^/ifc/ Re{h{v)e-'''')dv, 

k=i •'^k 

where the cutoff X is a minimal integer such that J2k=i^k ^ n^^^bn', this 
cutoff corresponds to the considered class of at least twice differentiable 
densities. The estimator (B.7) will be called the EP estimator for the case 
of infinite support. To better appreciate it, let us recall the EP density 
estimator for the finite support [0, 1] . The main difference is that here a 
discrete Fourier transform is used: 

K 

(B.8) f(z) := 1 + ^ 5] e,^,{z), 

k=l j&Bk 

where {\^Lpj{z) = 2^/^ cos(7rjz), j = 1,2, . . .} is the classical cosine basis on 
[0, 1], {9j} are empirical Fourier coefficients [estimates of Fourier coefficients 

n 

(B.9) 9,:=n-'Y.V^ji^i)^ 

1=1 

and the smoothing weights (coefficients, filter) are 

(B.IO) fik:= ''''^ty^^f' l(Lj:' E e]>il + t,)n~\ 
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Here the set of positive integers is divided into a sequence of blocks (including 
only neighbors) of cardinality L^. Note that the EP infinite- and finite- 
support density estimates do look alike. 

Finally, if Zi,. . . ,Zn are unobserved regression errors, then the EP esti- 
mate becomes a Pinsker oracle. In this article, for both finite and infinite 
supports, this oracle is denoted as fp{z,Zf). Also, let us introduce the no- 
tation 0fc := L^^ /g^ dv and := L^"*^ J2jeBk infinite and 
finite supports, respectively, and fi^ := ©fc/(0fc + n~^). Then fp{z, Z") will 
denote a super-oracle (Wiener filter) which uses Hk in place of fik or fik in 
the EP estimate; note that the super-oracle knows an estimated density / 
and this is the oracle that is traditionally considered in the case of direct 
observations; see [14]. 

In what follows Cs denote generic positive constants and it is understood 
that the oracles vanish beyond the unit interval in the finite support case. 
Let us formulate a main property of the EP estimate; the result is new for 
the case of infinite support and it is due to Efromovich [11] for finite support. 



Theorem B 1 . Suppose that Zi , . . . , Z„ are i. i. d. realizations from a 
square integrahle density f with known support that can be either [0, 1] or 
(— cx), oo). Consider the case of bounded thresholds t^ < C. Then the MISE of 
the Pinsker oracle (EP estimate) fp{z,Zi) satisfies the upper bound (oracle 
inequality) 

E r {fp{z,Z^)-f{z)fdz 



(B.ll) 



< min E 



+ 



+ 



{fUz,z^)-f{z)rdz, 

X) 

k=l k>K J / 



n 



k=l 



K 



Cd)n-^Y.Lk\'' 



k=l 



where the constant c* is 1 or tt ^ , and the functional df is 1 + Y^'jLi j^j 
/g°° \h{v)\dv, for the finite- and infinite- support cases, respectively. 



or 



There are many important corollaries of Theorem Bl. We present only 
two that are relevant to the topic of error density estimation: sharp minimax 
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estimation of differentiable and analytic densities. We begin with the case of 
differentiable densities. For infinite support, we introduce a famihar Sobolev 
class S{a,Q) := {f : JZolfi^) + if'^Hmdz < Q} = {h: (Itt)-' JZo{l + 
\v\'^°')\h{v)\'^ dv < Q}, where f^"\ a>2, is the ath generalized derivative 
and < Q < oo; see [33], page 144, [24] and [36]. With some obvious abuse 
of notation, for the case of finite support we define a similar Sobolev class 
S{a,Q) := + (ttj)^")^! < Q}; see [14], Chapter 2. Here we are 

interested only in the case a > 2; more general Sobolev classes are considered 
in [20]. 

Corollary B1. Consider the setting of Theorem Bl. Suppose that 
a>2 and that blocks and thresholds t^ used by the Pinsker oracle (EP 
estimate) fp{z,Zi) satisfy 

oo 

(B.12) < Lk+i/Lk = 1, lim t^ = 0. 

fe— >00 K— ►oo 

fe = l 

Then the Pinsker oracle (EP estimate) is sharp minimax over Sobolev den- 
sities and 

sup E r [r4S{a,Q))ifp{z,Z^) - f{z))f dz 
(B.13) =(l + o(l))inf sup E r [rn{S{a,Q)){f{z)-f{z) fdz 

f /G5KQ) J-oo 

= (l + o(l)), 

where the infimum is taken over all possible estimates f based on obser- 
vations Zi and parameters a and Q, and the sharp normalizing factor is 
defined in (3.8). 

Differentiable densities are traditionally studied in the nonparametric den- 
sity estimation literature; see [24] and [37]. In the regression literature, typ- 
ical error distributions are analytic, such as normal, mixture of normals and 
other stable distributions. For the case of infinite support, let us consider a 
class of such distributions studied in [25] . We say that / belongs to an ana- 
lytic class ^(7, Q), < 7 < cxD, < Q < 00, if f{z), —00 < z < 00, has contin- 
uation into the strip {z + iy: \y\ < 7, z G (—00, 00)}, f{z + iy) is analytic in- 
side this strip, bounded up to its boundary and /^(Re{/(z-|-i7)})^ dz < Q. 
Note that this class includes, among others, normal. Student and Cauchy 
densities, as well as their mixtures and analytic one-to-one transformations. 
The main feature of these densities is a very fast (exponential) decrease of 
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the corresponding characteristic functions, namely, according to Achieser 
([1], page 251), 



/oo 
[e^'' + e-^''f\h{v)\^ dv <^t:Q. 
-oo 



As a result, we may expect almost parametric rates of MISE convergence. A 
finite-support counterpart of this class is well-known in the literature, and it 
is defined (with the obvious abuse of notation) as ^(7,(5) := + 
g2^7i)02 < g|. [^4j^ Chapter 2. 

Corollary B2. Consider the setting of Theorem Bl, and let (B.12) 
hold. Then the Pinsker oracle (EP estimate) fp{z,Zi) is sharp minimax 
over analytic densities and 



/oo 
[r„M(7,Q))(/p(2,Zr)-/(z))f 
-00 

(B.15) =(l + o(l))inf sup E r K{A{j,Q)){f{z)-f{z))fdz 

f f&Ai-r^Q) J-00 

= (1+0(1)), 

where the infimum is taken over all estimates f based on observations 
and parameters 7 and Q, and the sharp normalizing factor is defined in 
(3.11). 

These results show that the EP-estimate is simultaneously sharp adaptive 
over the union of differentiable and analytic densities. This allows us to 
conclude that the EP estimate is a feasible choice for a Pinsker oracle. Only 
to be specific, in Section 3 a Pinsker oracle with = k"^ and t^ = ln~^ (2 -|- k) 
is considered; note that this choice satisfies (B.12) and tk < 1. 

Now let us verify the stated results. 

Proof of Theorem Bl. The assertion plainly follows from [11] for the 
finite-support case. Let us consider the infinite-support case. The plan is to 
follow along and employ the main parts of the proof presented in [11]; using 
the same notation will help us to do this. Set Qk '■= L'^^ /g^ l^(^)P dv — 

and note that (B.6) can be rewritten as fik = Qk{^k + n~^)~^I{Qk > tku"^). 
Then, using (B.2) and the Plancherel identity, we write 

{fp{z,Z^)-f{z)fdz 



/ Reih{v)e-'''')dv 
k=i •'^k 
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dz 



+ 



roc 




' — oo 





1 2 



h{v)I{v G Bk)e-'''^ } dv 



k>K 



dz 



^'^T. f \f^khiv)-hiv)\^dv + TT'^Y. f \h{v)\^dv. 



This yields 



E {fp{z,Z^)-f{z)fdz 



(B.16) 



K 

TT-^Y.^ lif^kHv) - h{v)) + {fik - f^k)h{v)f dv 
k=l ''^k 



+ 7r-i J2 LkQk 

k>K 



K 

k=l k>K 

Now we evaluate a particular A^, 1 < k < K. Using the Cauchy inequality, 
we get 



(B.17) 



^fc < + )i? / WHv) - h{v)\' dv 

+ {l + tl^'^)E[{(,u-lJ^k? I \h{v)\^dv 



Note that 7^-^T.k=i^ki + 7^~^i:k>KLkQk = Ej^^{rp{z,Z^)-f{z)fdz. 
On the other hand, using (B.2), we get 

Aki=E f \^,i,{h{v) - h{v)) - {1 - fik)h{v)\^ dv 



1/2. 



(B.18) 



lilE \h{v)-h{v)\^dv + {l-iskf \h{v)\^dv 



Bk 



Bt 
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I {l-\h{v)\^)dv + {l-^ikfLkQk 

n^^Lk^ik- 



To evaluate ^^2, we note that Ai^2 = LikE{(jli^ — ^k)'^{^k + '^~"'^)}- Thus, at 
least formally, this term is identical to the same term in line (5.9) of [11]. 
To follow along the evaluation of Af^2 in [lljj we need to verify that 

2 



(B.19) 



E{@k - @kT <C[ \h{v)\dv Ll'n~\Qk + n"^) 



-1\2 



l<k<K. 



This is done by a direct calculation which is similar to the proof of Lemma 
3 in [11]; see also [20]. Then, similarly to lines (5.10)-(5.11) in [11], we get 

k=l k=l 

Combining the results in (B.16) verifies Theorem Bl. □ 

Proof of Corollaries Bl and B2. The second asymptotic equali- 
ties in these corollaries are established in [36]. The first asymptotic equalities 
follow from Theorem Bl. □ 
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